Ensemble Techniques - Travel Package Purchase Prediction

Description

Background and Context

  • You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

  • A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

  • One of the ways to expand the customer base is to introduce a new offering of packages.

  • Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

  • However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

  • The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

  • However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

  • You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced package.

Objective

  • To predict which customer is more likely to purchase the newly introduced travel package.

Data Dictionary

Customer details:

  • CustomerID: Unique customer ID
  • ProdTaken: Product taken flag
  • Age: Age of customer
  • PreferredLoginDevice: Preferred login device of the customer in last month
  • CityTier: City tier
  • Occupation: Occupation of customer
  • Gender: Gender of customer
  • NumberOfPersonVisited: Total number of person came with customer
  • PreferredPropertyStar: Preferred hotel property rating by customer
  • MaritalStatus: Marital status of customer
  • NumberOfTrips: Average number of the trip in a year by customer
  • Passport: Customer passport flag
  • OwnCar: Customers owns a car flag
  • NumberOfChildrenVisited: Total number of children visit with customer
  • Designation: Designation of the customer in the current organization
  • MonthlyIncome: Gross monthly income of the customer

Customer interaction data:

  • PitchSatisfactionScore: Sales pitch satisfactory score
  • ProductPitched: Product pitched by a salesperson
  • NumberOfFollowups: Total number of follow up has been done by sales person after sales pitch
  • DurationOfPitch: Duration of the pitch by a salesman to customer

Criteria Perform an Exploratory Data Analysis on the data

  • Univariate analysis - Bivariate analysis - Use appropriate visualizations to identify the patterns and insights - Come up with a customer profile (characteristics of a customer) of the different packages - Any other exploratory deep dive

Points : 7.5

Criteria Illustrate the insights based on EDA

Key meaningful observations on the relationship between variables

Points : 5

Criteria Data Pre-processing

Prepare the data for analysis - Missing value Treatment, Outlier Detection(treat, if needed- why or why not ), Feature Engineering, Prepare data for modeling

Points : 7.5

Criteria Model building - Bagging

  • Build bagging classifier, random forest and decision tree. ### Points : 6

Criteria Model performance evaluation and improvement

  • Comment on which metric is right for model performance evaluation and why? - Comment on model performance - Can model performance be improved? check and comment ### Points : 9

Criteria Model building - Boosting

  • Build Adaboost, gradient boost, xgboost and stacking classifier ### Points : 8

Criteria Model performance evaluation and improvement

  • Comment on which metric is right for model performance evaluation and why? - Comment on model performance - Can model performance be improved? check and comment ### Points : 12

Criteria Actionable Insights & Recommendations

  • Compare models - Business recommendations and insights ### Points : 5

Total Points : 60

Let's start by importing libraries we need.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np   
import pandas as pd    
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor,RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from xgboost import XGBRegressor
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, train_test_split
In [2]:
#Loading dataset
# data = pd.read_excel("Tourism.xlsx", sheet_name="Tourism")

xls = pd.ExcelFile('Tourism.xlsx')
datadictonary = pd.read_excel(xls, 'Data Dict')
data = pd.read_excel(xls, 'Tourism')

View the first 5 rows of the dataset.

In [3]:
data.head()
Out[3]:
CustomerID ProdTaken Age PreferredLoginDevice CityTier DurationOfPitch Occupation Gender NumberOfPersonVisited NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisited Designation MonthlyIncome
0 200000 1 41.0 Self Enquiry 3 6.0 Salaried Female 3 3.0 Super Deluxe 3.0 Single 1.0 1 2 1 0.0 Manager 20993.0
1 200001 0 49.0 Company Invited 1 14.0 Salaried Male 3 4.0 Super Deluxe 4.0 Divorced 2.0 0 3 1 2.0 Manager 20130.0
2 200002 1 37.0 Self Enquiry 1 8.0 Free Lancer Male 3 4.0 Multi 3.0 Single 7.0 1 3 0 0.0 Executive 17090.0
3 200003 0 33.0 Company Invited 1 9.0 Salaried Female 2 3.0 Multi 3.0 Divorced 2.0 1 5 1 1.0 Executive 17909.0
4 200004 0 NaN Self Enquiry 1 8.0 Small Business Male 2 3.0 Multi 4.0 Divorced 1.0 0 5 1 0.0 Executive 18468.0

Check data types and number of non-null values for each column.

In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   CustomerID               4888 non-null   int64  
 1   ProdTaken                4888 non-null   int64  
 2   Age                      4662 non-null   float64
 3   PreferredLoginDevice     4863 non-null   object 
 4   CityTier                 4888 non-null   int64  
 5   DurationOfPitch          4637 non-null   float64
 6   Occupation               4888 non-null   object 
 7   Gender                   4888 non-null   object 
 8   NumberOfPersonVisited    4888 non-null   int64  
 9   NumberOfFollowups        4843 non-null   float64
 10  ProductPitched           4888 non-null   object 
 11  PreferredPropertyStar    4862 non-null   float64
 12  MaritalStatus            4888 non-null   object 
 13  NumberOfTrips            4748 non-null   float64
 14  Passport                 4888 non-null   int64  
 15  PitchSatisfactionScore   4888 non-null   int64  
 16  OwnCar                   4888 non-null   int64  
 17  NumberOfChildrenVisited  4822 non-null   float64
 18  Designation              4888 non-null   object 
 19  MonthlyIncome            4655 non-null   float64
dtypes: float64(7), int64(7), object(6)
memory usage: 763.9+ KB
In [5]:
data["PreferredLoginDevice"] = data["PreferredLoginDevice"].astype("category")
data["Occupation"] = data["Occupation"].astype("category")
data["Gender"] = data["Gender"].astype("category")
data["ProductPitched"] = data["ProductPitched"].astype("category")
data["MaritalStatus"] = data["MaritalStatus"].astype("category")
data["Designation"] = data["Designation"].astype("category")

Observation

  • We can see that there are total 19 columns and 4,888 number of rows in the dataset.
  • 13 columns' data type is either integer or float, 6 columns - 'PreferredLoginDevice, Occupation, Gender, ProductPitched, MaritalStatus, Designation' which is of object type.
  • The number of non-null value of each column is equal to number of total rows in the dataset i.e. no null value. We can further confirm this using isna() method.**
In [6]:
data.isna().sum()
Out[6]:
CustomerID                   0
ProdTaken                    0
Age                        226
PreferredLoginDevice        25
CityTier                     0
DurationOfPitch            251
Occupation                   0
Gender                       0
NumberOfPersonVisited        0
NumberOfFollowups           45
ProductPitched               0
PreferredPropertyStar       26
MaritalStatus                0
NumberOfTrips              140
Passport                     0
PitchSatisfactionScore       0
OwnCar                       0
NumberOfChildrenVisited     66
Designation                  0
MonthlyIncome              233
dtype: int64

Observation

  • Age, DurationOfPitch, NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, NumberOfChildrenVisited, MonthlyIncome columns are having null values
In [7]:
data.median()
Out[7]:
CustomerID                 202443.5
ProdTaken                       0.0
Age                            36.0
CityTier                        1.0
DurationOfPitch                13.0
NumberOfPersonVisited           3.0
NumberOfFollowups               4.0
PreferredPropertyStar           3.0
NumberOfTrips                   3.0
Passport                        0.0
PitchSatisfactionScore          3.0
OwnCar                          1.0
NumberOfChildrenVisited         1.0
MonthlyIncome               22347.0
dtype: float64
In [8]:
# replace the missing values with median value.
# Note, we do not need to specify the column names below
# every column's missing value is replaced with that column's median respectively  (axis =0 means columnwise)
data = data.fillna(data.median())
In [9]:
data.isna().sum()
Out[9]:
CustomerID                  0
ProdTaken                   0
Age                         0
PreferredLoginDevice       25
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisited       0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisited     0
Designation                 0
MonthlyIncome               0
dtype: int64
In [10]:
data.dtypes
Out[10]:
CustomerID                    int64
ProdTaken                     int64
Age                         float64
PreferredLoginDevice       category
CityTier                      int64
DurationOfPitch             float64
Occupation                 category
Gender                     category
NumberOfPersonVisited         int64
NumberOfFollowups           float64
ProductPitched             category
PreferredPropertyStar       float64
MaritalStatus              category
NumberOfTrips               float64
Passport                      int64
PitchSatisfactionScore        int64
OwnCar                        int64
NumberOfChildrenVisited     float64
Designation                category
MonthlyIncome               float64
dtype: object

Summary of the dataset

In [11]:
# Summary of continuous columns
data.describe().T
Out[11]:
count mean std min 25% 50% 75% max
CustomerID 4888.0 202443.500000 1411.188388 200000.0 201221.75 202443.5 203665.25 204887.0
ProdTaken 4888.0 0.188216 0.390925 0.0 0.00 0.0 0.00 1.0
Age 4888.0 37.547259 9.104795 18.0 31.00 36.0 43.00 61.0
CityTier 4888.0 1.654255 0.916583 1.0 1.00 1.0 3.00 3.0
DurationOfPitch 4888.0 15.362930 8.316166 5.0 9.00 13.0 19.00 127.0
NumberOfPersonVisited 4888.0 2.905074 0.724891 1.0 2.00 3.0 3.00 5.0
NumberOfFollowups 4888.0 3.711129 0.998271 1.0 3.00 4.0 4.00 6.0
PreferredPropertyStar 4888.0 3.577946 0.797005 3.0 3.00 3.0 4.00 5.0
NumberOfTrips 4888.0 3.229746 1.822769 1.0 2.00 3.0 4.00 22.0
Passport 4888.0 0.290917 0.454232 0.0 0.00 0.0 1.00 1.0
PitchSatisfactionScore 4888.0 3.078151 1.365792 1.0 2.00 3.0 4.00 5.0
OwnCar 4888.0 0.620295 0.485363 0.0 0.00 1.0 1.00 1.0
NumberOfChildrenVisited 4888.0 1.184738 0.852323 0.0 1.00 1.0 2.00 3.0
MonthlyIncome 4888.0 23559.179419 5257.862921 1000.0 20485.00 22347.0 25424.75 98678.0
  • DurationOfPitch, NumberOfTrips, MonthlyIncome are outliers
  • Target variable seems to have skewed distribution as higher values are on the right. We will explore this further.

Univariate and Bivariate Analysis

To do - Identify insights if any from the distributuions.

Number of unique values in each column

In [12]:
data.shape
Out[12]:
(4888, 20)
In [13]:
data.nunique()
Out[13]:
CustomerID                 4888
ProdTaken                     2
Age                          44
PreferredLoginDevice          2
CityTier                      3
DurationOfPitch              34
Occupation                    4
Gender                        3
NumberOfPersonVisited         5
NumberOfFollowups             6
ProductPitched                5
PreferredPropertyStar         3
MaritalStatus                 4
NumberOfTrips                12
Passport                      2
PitchSatisfactionScore        5
OwnCar                        2
NumberOfChildrenVisited       4
Designation                   5
MonthlyIncome              2475
dtype: int64
  • We can drop 'instant' column as it is an ID variable and will not add value to the model.
  • We can drop 'dteday' column as it just contains dates of 731 days i.e. 2 years. This will not add value to the model.

Number of observations in each category

In [15]:
cat_cols=['ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation']

for column in cat_cols:
    print('--**'*10)
    print(data[column].value_counts())
    print('**--'*10)
--**--**--**--**--**--**--**--**--**--**
0    3968
1     920
Name: ProdTaken, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
36.0    457
35.0    237
34.0    211
31.0    203
30.0    199
32.0    197
33.0    189
37.0    185
29.0    178
38.0    176
41.0    155
39.0    150
28.0    147
40.0    146
42.0    142
27.0    138
43.0    130
46.0    121
45.0    116
26.0    106
44.0    105
51.0     90
47.0     88
50.0     86
25.0     74
52.0     68
53.0     66
48.0     65
49.0     65
55.0     64
54.0     61
56.0     58
24.0     56
22.0     46
23.0     46
59.0     44
21.0     41
20.0     38
19.0     32
58.0     31
60.0     29
57.0     29
18.0     14
61.0      9
Name: Age, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
Self Enquiry       3444
Company Invited    1419
Name: PreferredLoginDevice, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
1    3190
3    1500
2     198
Name: CityTier, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
9.0      483
13.0     474
7.0      342
8.0      333
6.0      307
16.0     274
15.0     269
14.0     253
10.0     244
11.0     205
12.0     195
17.0     172
30.0      95
22.0      89
31.0      83
23.0      79
18.0      75
32.0      74
29.0      74
21.0      73
25.0      73
27.0      72
26.0      72
24.0      70
35.0      66
20.0      65
28.0      61
33.0      57
19.0      57
34.0      50
36.0      44
5.0        6
126.0      1
127.0      1
Name: DurationOfPitch, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
Salaried          2368
Small Business    2084
Large Business     434
Free Lancer          2
Name: Occupation, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
Male       2916
Female     1817
Fe Male     155
Name: Gender, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
3    2402
2    1418
4    1026
1      39
5       3
Name: NumberOfPersonVisited, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
4.0    2113
3.0    1466
5.0     768
2.0     229
1.0     176
6.0     136
Name: NumberOfFollowups, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
Multi           1842
Super Deluxe    1732
Standard         742
Deluxe           342
King             230
Name: ProductPitched, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
3.0    3019
5.0     956
4.0     913
Name: PreferredPropertyStar, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
Married      2340
Divorced      950
Single        916
Unmarried     682
Name: MaritalStatus, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
2.0     1464
3.0     1219
1.0      620
4.0      478
5.0      458
6.0      322
7.0      218
8.0      105
21.0       1
19.0       1
22.0       1
20.0       1
Name: NumberOfTrips, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
0    3466
1    1422
Name: Passport, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
3    1478
5     970
1     942
4     912
2     586
Name: PitchSatisfactionScore, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
1    3032
0    1856
Name: OwnCar, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
1.0    2146
2.0    1335
0.0    1082
3.0     325
Name: NumberOfChildrenVisited, dtype: int64
**--**--**--**--**--**--**--**--**--**--
--**--**--**--**--**--**--**--**--**--**
Executive         1842
Manager           1732
Senior Manager     742
AVP                342
VP                 230
Name: Designation, dtype: int64
**--**--**--**--**--**--**--**--**--**--

Summary & Observation

  • Product taken flag count: 0 - 3968 & 1 - 920

  • Preferred login device of customer in last month: Self Enquiry - 3444, Company Invited - 1419

    • Self enquiry customers are more then compnay invited
  • City tier: (1 - 3190, 2 - 198, 3- 1500)

    • Maximum customer are from type 1 city
  • Occupation of customer: (Salaried - 2368, Small Business - 2084, Large Business - 434, Free Lancer - 2)

    • Occupation of maximum cutomers eirther Salaried or Small Business
  • Gender of customer: (Male - 2916, Female - 1817, Fe Male - 155)

    • More than 50% customers are male
    • 155 records data having value as "Fe Male" - it is related to data entry issue so need to update the value from "Fe Male" to "Female"
  • Total number of person came with customer: (1 - 39, 2 - 1418, 3 - 2402, 4 - 1026, 5 - 3)

    • Customer with 1 and 5 persons are very less
  • Total number of follow up has been done by sales person after sales pitch: (1.0 - 176, 2.0 - 229, 3.0 - 1466, 4.0 - 2113, 5.0 - 768, 6.0 - 136)

    • Maximum follow-up by sales person either 3 or 4 times
  • Product pitched by sales person - (Multi - 1842, Super Deluxe - 1732, Standard: 742, Deluxe - 342, King - 230)

    • Multi , Super Deluxe types of production choosed by majority of customers
  • Preferred hotel property rating by customer: (3.0 - 3019, 4.0 - 913, 5.0 - 956)

    • Majority of customer would like to stayed in 3 stars property
  • Marital status of customer - (Married -2340, Divorced- 950, Single- 916, Unmarried-682)

    • Majority of customers are married
  • Average number of trip in a year by customer: (1.0 - 620, 2.0 - 1464, 3.0 - 1219, 4.0 - 478, 5.0 - 458, 6.0 - 322, 7.0 - 218, 8.0 - 105, 19.0 - 1, 20.0 - 1 , 21.0 - 1, 22.0 - 1)

    • Evenly distributed. However, number of visits for 2 and 3 times are as majority.
  • Customer passport flag - (0 - 3466, 1 - 1422)

    • Count of overseas customer are less
  • Sales pitch satisfactory score - (1 - 942, 2 - 586 , 3 - 1478, 4 - 912 , 5 - 970)

    • Evenly distributed customer. However, sales persons with 3 is maximum
  • Customers owns a car flag - (1 - 3032, 0 - 1856)

    • Customers owns a car as 1 are more, Howevere, It hardly matters
  • Total number of children v isit with customer - (0.0 - 1082, 1.0 - 2146, 2.0 - 1335, 3.0 - 325)

    • Customers either wanted to visit alone or with 1 or 2 memeber are maximum
  • Designation of customer in current organization - (Executive - 1842, Manager - 1732, Senior Manager - 742, AVP - 342, VP - 230)

    • Manily Executive and Manager are in Majority
  • Age of customer, Duration of pitch by a sales man to customer

Let update the value of Gender from Fe Male to Female

In [16]:
# data[ np.logical_and(data.Gender=='Fe Male'), ['Gender'] ] = 'Female'
data.loc[(data.Gender == 'Fe Male'),'Gender']='Female'
In [17]:
print(data['Gender'].value_counts())
Male       2916
Female     1972
Fe Male       0
Name: Gender, dtype: int64

Univariate analysis

Histogram Plot and Box Plt after above Observations for all features

In [18]:
sns.histplot(data['Age'])
Out[18]:
<AxesSubplot:xlabel='Age', ylabel='Count'>
  • Maximum customer are of age 35
In [19]:
sns.histplot(data['CityTier'])
Out[19]:
<AxesSubplot:xlabel='CityTier', ylabel='Count'>
  • Maximum Customer are from Type 1 city
In [20]:
sns.histplot(data['DurationOfPitch'])
Out[20]:
<AxesSubplot:xlabel='DurationOfPitch', ylabel='Count'>
  • Duration of pitch by a sales man to customer is below 40 and maximum count over 450 and under 10
In [21]:
sns.histplot(data['Occupation'])
Out[21]:
<AxesSubplot:xlabel='Occupation', ylabel='Count'>
  • Occupation of maximum cutomers eirther Salaried or Small Business
  • Large is very as compare to Salaried or Small Business
In [22]:
sns.histplot(data['Gender'])
Out[22]:
<AxesSubplot:xlabel='Gender', ylabel='Count'>
  • Male customers are more than 2500
  • Female cutomers are less than 2000
In [23]:
sns.histplot(data['NumberOfPersonVisited'])
Out[23]:
<AxesSubplot:xlabel='NumberOfPersonVisited', ylabel='Count'>
  • Over 2000 customers with 3 persons
  • Over 1000 customers with 2 persons
  • less then 1000 customers with 4 persons
  • Customers with 1 persons are very less
In [24]:
sns.histplot(data['NumberOfFollowups'])
Out[24]:
<AxesSubplot:xlabel='NumberOfFollowups', ylabel='Count'>
  • 4 times follow-up done with over 2000 customers
  • 3 times follow-up done with over 1200 customers
  • 2 times follow-up done with over 750 customers
  • 1,2,6 times follow up with less customers
In [25]:
sns.histplot(data['ProductPitched'])
Out[25]:
<AxesSubplot:xlabel='ProductPitched', ylabel='Count'>
  • Product type "Multi" pitched over 1750 time by sales person
  • Product type "Super Deluxe" pitched less than 1750 time by sales person
  • Others are less than 750 time pitched sales person
In [26]:
sns.histplot(data['PreferredPropertyStar'])
Out[26]:
<AxesSubplot:xlabel='PreferredPropertyStar', ylabel='Count'>
  • Preferred hotel property rating by customer
    • 3 star around 3000
    • 4 * 5 star is less then 1000
In [27]:
sns.histplot(data['MaritalStatus'])
Out[27]:
<AxesSubplot:xlabel='MaritalStatus', ylabel='Count'>
  • Marital status of customer
    • Married over 2000
    • Single, Divorced, Unmarried is less then 1000
In [28]:
sns.histplot(data['NumberOfTrips'])
Out[28]:
<AxesSubplot:xlabel='NumberOfTrips', ylabel='Count'>
  • Average number of trip in a year by customer
    • 2 trip over 1400
    • 3 trip over 1200
    • 1 trip for around 600
    • other less than 500
In [29]:
sns.histplot(data['PitchSatisfactionScore'])
Out[29]:
<AxesSubplot:xlabel='PitchSatisfactionScore', ylabel='Count'>
  • Sales pitch satisfactory score
    • 3 star for more than 1400
    • others are less than 1000
In [30]:
sns.histplot(data['NumberOfChildrenVisited'])
Out[30]:
<AxesSubplot:xlabel='NumberOfChildrenVisited', ylabel='Count'>
  • Total number of children visit with customer
    • 2000 customer visited with 1 children
    • customer visited with 2 children is over 1000
    • around 1000 customer visited with no children
In [31]:
sns.histplot(data['Designation'])
Out[31]:
<AxesSubplot:xlabel='Designation', ylabel='Count'>
  • Designation of customer in current organization
    • Executive are over 1750
    • Managers are less than 1750
In [32]:
sns.boxplot(data['Age'])
Out[32]:
<AxesSubplot:xlabel='Age'>
  • All customers age is from 30 to 45.
  • No outlier
In [33]:
sns.boxplot(data['DurationOfPitch'])
Out[33]:
<AxesSubplot:xlabel='DurationOfPitch'>
  • Duration of pitch by a sales man to customer
    • have outliers
In [34]:
sns.boxplot(data['NumberOfPersonVisited'])
Out[34]:
<AxesSubplot:xlabel='NumberOfPersonVisited'>
  • Total number of person came with customer - 5 outliers
In [35]:
sns.boxplot(data['NumberOfFollowups'])
Out[35]:
<AxesSubplot:xlabel='NumberOfFollowups'>
  • Total number of follow up has been done by sales person after sales pitch
    • outliers on both sides
In [36]:
sns.boxplot(data['PreferredPropertyStar'])
Out[36]:
<AxesSubplot:xlabel='PreferredPropertyStar'>
  • Preferred hotel property rating by customer
    • no outlers for choosein property
In [37]:
sns.boxplot(data['NumberOfTrips'])
Out[37]:
<AxesSubplot:xlabel='NumberOfTrips'>
  • Average number of trip in a year by customer: outlier
In [38]:
#Top 5 highest values 
data['Age'].nlargest()
Out[38]:
2855    61.0
2871    61.0
2980    61.0
3323    61.0
3653    61.0
Name: Age, dtype: float64
  • Four customes of age 61

Function to create barplots for each feature

In [39]:
cat_cols=['ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation']

for column in cat_cols:
    plt.figure(figsize = (20,15))
    sns.countplot(data[column])
    plt.show()

Bivariate analysis

In [40]:
cat_cols=['ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation']

for column in cat_cols:
    sns.set(rc={'figure.figsize':(21,7)})
    sns.catplot(x=column, y="Age", kind="swarm", data=data, height=7, aspect=3);
In [41]:
cat_cols=['ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation']

for column in cat_cols:
    sns.set(rc={'figure.figsize':(21,7)})
    sns.catplot(x=column, y="Age", data=data, kind='bar', size=6, aspect=1.5, estimator=np.mean);
In [42]:
sns.pairplot(
    data,
    height=4,
    aspect=1
    );
In [43]:
sns.set(rc={'figure.figsize':(16,10)})
sns.heatmap(data.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="YlGnBu")
plt.show()
In [55]:
# Separating features and the target column
X = data.drop(['PreferredLoginDevice','Occupation','Gender','ProductPitched','MaritalStatus','Designation'], axis=1)
y = data[['ProdTaken','Age','CityTier','DurationOfPitch','NumberOfPersonVisited','NumberOfFollowups','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisited']]
In [56]:
# encoding the categorical variables
x = pd.get_dummies(X, drop_first=True)
x.head()
Out[56]:
CustomerID ProdTaken Age CityTier DurationOfPitch NumberOfPersonVisited NumberOfFollowups PreferredPropertyStar NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisited MonthlyIncome
0 200000 1 41.0 3 6.0 3 3.0 3.0 1.0 1 2 1 0.0 20993.0
1 200001 0 49.0 1 14.0 3 4.0 4.0 2.0 0 3 1 2.0 20130.0
2 200002 1 37.0 1 8.0 3 4.0 3.0 7.0 1 3 0 0.0 17090.0
3 200003 0 33.0 1 9.0 2 3.0 3.0 2.0 1 5 1 1.0 17909.0
4 200004 0 36.0 1 8.0 2 3.0 4.0 1.0 0 5 1 0.0 18468.0
In [57]:
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, shuffle=True)
In [58]:
X_train.shape, X_test.shape
Out[58]:
((3421, 14), (1467, 14))
  • We have 3,421 observations in the train set and 1,467 observations in the test set.

Building Models

  • We'll fit different models on the train data and observe their performance.
  • We'll try to improve that performance by tuning some hyperparameters available for that algorithm.
  • We'll use GridSearchCv for hyperparameter tuning and r_2 score to optimize the model.
  • R-square - Coefficient of determination is used to evaluate the performance of a regression model. It is the amount of the variation in the output dependent attribute which is predictable from the input independent variables.
  • Let's start by creating a function to get model scores, so that we don't have to use same codes repeatedly.
In [59]:
##  Function to calculate r2_score and RMSE on train and test data
def get_model_score(model, flag=True):
    '''
    model : classifier to predict values of X

    '''
    # defining an empty list to store train and test results
    score_list=[] 
    
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    
    train_r2=metrics.r2_score(y_train,pred_train)
    test_r2=metrics.r2_score(y_test,pred_test)
    train_rmse=np.sqrt(metrics.mean_squared_error(y_train,pred_train))
    test_rmse=np.sqrt(metrics.mean_squared_error(y_test,pred_test))
    
    #Adding all scores in the list
    score_list.extend((train_r2,test_r2,train_rmse,test_rmse))
    
    # If the flag is set to True then only the following print statements will be dispayed, the default value is True
    if flag==True: 
        print("R-sqaure on training set : ",metrics.r2_score(y_train,pred_train))
        print("R-square on test set : ",metrics.r2_score(y_test,pred_test))
        print("RMSE on training set : ",np.sqrt(metrics.mean_squared_error(y_train,pred_train)))
        print("RMSE on test set : ",np.sqrt(metrics.mean_squared_error(y_test,pred_test)))
    
    # returning the list with train and test scores
    return score_list
In [60]:
dtree=DecisionTreeRegressor(random_state=1)
dtree.fit(X_train,y_train)
Out[60]:
DecisionTreeRegressor(random_state=1)
In [61]:
dtree_score=get_model_score(dtree)
R-sqaure on training set :  1.0
R-square on test set :  0.2972166113750018
RMSE on training set :  0.0
RMSE on test set :  0.6974488520019424
  • Decision tree model with default parameters is overfitting the train data.
  • Let's see if we can reduce overfitting and improve performance on test data by tuning hyperparameters.

Hyperparameter Tuning

In [62]:
# Choose the type of classifier. 
dtree_tuned = DecisionTreeRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': list(np.arange(2,20)) + [None], 
              'min_samples_leaf': [1, 3, 5, 7, 10],
              'max_leaf_nodes' : [2, 3, 5, 10, 15] + [None],
              'min_impurity_decrease': [0.001, 0.01, 0.1, 0.0]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
dtree_tuned.fit(X_train, y_train)
Out[62]:
DecisionTreeRegressor(max_depth=11, min_samples_leaf=3, random_state=1)
In [63]:
dtree_tuned_score=get_model_score(dtree_tuned)
R-sqaure on training set :  0.6685458382995247
R-square on test set :  0.3275598777383334
RMSE on training set :  0.7405049074380436
RMSE on test set :  0.7978900181331905
  • The overfitting is reduced after hyperparameter tuning and test score has increased by approx 2%.
  • RMSE is also reduced on test data and the model is generalizing better than the decision tree model with default parameters.

Plotting the feature importance of each variable

In [64]:
# importance of features in the tree building ( The importance of a feature is computed as the 
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(pd.DataFrame(dtree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                              Imp
Age                      0.535620
DurationOfPitch          0.431197
PitchSatisfactionScore   0.019955
CityTier                 0.004408
NumberOfFollowups        0.002802
CustomerID               0.001395
NumberOfPersonVisited    0.001157
MonthlyIncome            0.001009
NumberOfChildrenVisited  0.000938
Passport                 0.000580
NumberOfTrips            0.000365
PreferredPropertyStar    0.000322
ProdTaken                0.000129
OwnCar                   0.000123
In [65]:
feature_names = X_train.columns
importances = dtree_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
In [66]:
rf_estimator=RandomForestRegressor(random_state=1)
rf_estimator.fit(X_train,y_train)
Out[66]:
RandomForestRegressor(random_state=1)
In [67]:
rf_estimator_score=get_model_score(rf_estimator)
R-sqaure on training set :  0.948786215923948
R-square on test set :  0.6333233563401046
RMSE on training set :  0.285887373240426
RMSE on test set :  0.47406804495498744
  • Random forest is giving good r2 score of 92% on the test data but it is slightly overfitting.
  • Random forest is giving good r2 score of 44% on the test data and it is overfitting.
  • Let's try to reduce this overfitting by hyperparameter tuning.

Hyperparameter Tuning

In [68]:
# Choose the type of classifier. 
rf_tuned = RandomForestRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {  
                'max_depth':[4, 6, 8, 10, None],
                'max_features': ['sqrt','log2',None],
                'n_estimators': [80, 90, 100, 110, 120]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
rf_tuned.fit(X_train, y_train)
Out[68]:
RandomForestRegressor(max_features='sqrt', n_estimators=120, random_state=1)
In [69]:
rf_tuned_score=get_model_score(rf_tuned)
R-sqaure on training set :  0.9660018483379824
R-square on test set :  0.7634051833518913
RMSE on training set :  0.38198841378664705
RMSE on test set :  0.8321234144321212
  • No significant change in the result. The result is almost same before or after the hyperparameter tuning.
In [70]:
# importance of features in the tree building ( The importance of a feature is computed as the 
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(pd.DataFrame(rf_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                              Imp
Age                      0.385629
DurationOfPitch          0.363596
MonthlyIncome            0.088358
CustomerID               0.031689
NumberOfTrips            0.027898
PitchSatisfactionScore   0.022151
NumberOfFollowups        0.015053
CityTier                 0.011355
NumberOfChildrenVisited  0.010896
PreferredPropertyStar    0.010736
NumberOfPersonVisited    0.010290
ProdTaken                0.008552
Passport                 0.007343
OwnCar                   0.006453
In [71]:
feature_names = X_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
  • Age is the most important feature, also DurationOfPitch, for the tuned random forest model.

AdaBoost Regressor

In [244]:
X_ada = data[['Age']]
y_ada = data[['CityTier','DurationOfPitch','NumberOfPersonVisited','NumberOfFollowups','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisited']]
# X_ada = data[['Occupation','Gender','ProductPitched','MaritalStatus','Designation']]
# y_ada = data[['ProdTaken','Age','CityTier','DurationOfPitch','NumberOfPersonVisited','NumberOfFollowups','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisited']]
# Separating features and the target column
# X = data.drop(['PreferredLoginDevice','Occupation','Gender','ProductPitched','MaritalStatus','Designation'], axis=1)
# y = data[['ProdTaken','Age','CityTier','DurationOfPitch','NumberOfPersonVisited','NumberOfFollowups','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisited']]
In [245]:
X_train_ada, X_test_ada, y_train_ada, y_test_ada = train_test_split(X_ada, y_ada, test_size=0.30, random_state=1, shuffle=True)
In [246]:
X_train_ada
Out[246]:
Age
3878 53.0
3933 39.0
3 33.0
4823 20.0
4230 50.0
... ...
2895 43.0
2763 33.0
905 29.0
3980 37.0
235 43.0

3421 rows × 1 columns

In [247]:
y_train_ada
Out[247]:
CityTier DurationOfPitch NumberOfPersonVisited NumberOfFollowups Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisited
3878 3 127.0 3 4.0 0 1 1 2.0
3933 1 9.0 3 4.0 1 1 0 1.0
3 1 9.0 2 3.0 1 5 1 1.0
4823 3 12.0 4 4.0 1 4 1 1.0
4230 1 7.0 3 5.0 1 3 0 1.0
... ... ... ... ... ... ... ... ...
2895 1 31.0 3 4.0 1 2 1 2.0
2763 3 15.0 4 5.0 1 2 1 1.0
905 1 6.0 2 4.0 0 2 0 1.0
3980 1 18.0 4 5.0 0 4 1 2.0
235 3 22.0 3 3.0 1 3 0 1.0

3421 rows × 8 columns

In [248]:
##  Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
    '''
    model : classifier to predict values of X

    '''
    # defining an empty list to store train and test results
    score_list=[] 
    
    pred_train_ada = model.predict(X_train_ada)
    pred_test_ada = model.predict(X_test_ada)
    
    train_acc_ada = model.score(X_train_ada,y_train_ada)
    test_acc_ada = model.score(X_test_ada,y_test_ada)
    
    train_recall_ada = metrics.recall_score(y_train_ada,pred_train_ada)
    test_recall_ada = metrics.recall_score(y_test_ada,pred_test_ada)
    
    train_precision_ada = metrics.precision_score(y_train_ada,pred_train_ada)
    test_precision_ada = metrics.precision_score(y_test_ada,pred_test_ada)
    
    score_list.extend((train_acc_ada,test_acc_ada,train_recall_ada,test_recall_ada,train_precision_ada,test_precision_ada))
        
    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True: 
        print("Accuracy on training set : ",model.score(X_train_ada,y_train_ada))
        print("Accuracy on test set : ",model.score(X_test_ada,y_test_ada))
        print("Recall on training set : ",metrics.recall_score(y_train_ada,pred_train_ada))
        print("Recall on test set : ",metrics.recall_score(y_test_ada,pred_test_ada))
        print("Precision on training set : ",metrics.precision_score(y_train_ada,pred_train_ada))
        print("Precision on test set : ",metrics.precision_score(y_test_ada,pred_test_ada))
    
    return score_list # returning the list with train and test scores
In [249]:
ab_regressor=AdaBoostRegressor(random_state=1)
ab_regressor.fit(X_train_ada,y_train_ada)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-249-01fd51a3da69> in <module>
      1 ab_regressor=AdaBoostRegressor(random_state=1)
----> 2 ab_regressor.fit(X_train_ada,y_train_ada)

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in fit(self, X, y, sample_weight)
   1005 
   1006         # Fit
-> 1007         return super().fit(X, y, sample_weight)
   1008 
   1009     def _validate_estimator(self):

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in fit(self, X, y, sample_weight)
    102             raise ValueError("learning_rate must be greater than zero")
    103 
--> 104         X, y = self._validate_data(X, y,
    105                                    accept_sparse=['csr', 'csc'],
    106                                    ensure_2d=True,

D:\Anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    805                         ensure_2d=False, dtype=None)
    806     else:
--> 807         y = column_or_1d(y, warn=True)
    808         _assert_all_finite(y)
    809     if y_numeric and y.dtype.kind == 'O':

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
    843         return np.ravel(y)
    844 
--> 845     raise ValueError(
    846         "y should be a 1d array, "
    847         "got an array of shape {} instead.".format(shape))

ValueError: y should be a 1d array, got an array of shape (3421, 8) instead.
In [250]:
pred_train_ada = ab_regressor.predict(X_train_ada)
pred_test_ada = ab_regressor.predict(X_test_ada)
    
train_acc_ada = ab_regressor.score(X_train_ada,y_train_ada)
test_acc_ada = ab_regressor.score(X_test_ada,y_test_ada)
    
train_recall_ada = metrics.recall_score(y_train_ada,pred_train_ada)
# test_recall_ada = metrics.recall_score(y_test_ada,pred_test_ada)
    
#     train_precision_ada = metrics.precision_score(y_train_ada,pred_train_ada)
#     test_precision_ada = metrics.precision_score(y_test_ada,pred_test_ada)
    
---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-250-63cb53a7d04b> in <module>
----> 1 pred_train_ada = ab_regressor.predict(X_train_ada)
      2 pred_test_ada = ab_regressor.predict(X_test_ada)
      3 
      4 train_acc_ada = ab_regressor.score(X_train_ada,y_train_ada)
      5 test_acc_ada = ab_regressor.score(X_test_ada,y_test_ada)

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in predict(self, X)
   1143             The predicted regression values.
   1144         """
-> 1145         check_is_fitted(self)
   1146         X = self._check_X(X)
   1147 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
   1017 
   1018     if not attrs:
-> 1019         raise NotFittedError(msg % {'name': type(estimator).__name__})
   1020 
   1021 

NotFittedError: This AdaBoostRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
In [211]:
ab_regressor_score=get_metrics_score(ab_regressor)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-211-6813d838ac33> in <module>
----> 1 ab_regressor_score=get_metrics_score(ab_regressor)

<ipython-input-209-743a61dca9b5> in get_metrics_score(model, flag)
     14     test_acc_ada = model.score(X_test_ada,y_test_ada)
     15 
---> 16     train_recall_ada = metrics.recall_score(y_train_ada,pred_train_ada)
     17     test_recall_ada = metrics.recall_score(y_test_ada,pred_test_ada)
     18 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py in recall_score(y_true, y_pred, labels, pos_label, average, sample_weight, zero_division)
   1733     ``zero_division``.
   1734     """
-> 1735     _, r, _, _ = precision_recall_fscore_support(y_true, y_pred,
   1736                                                  labels=labels,
   1737                                                  pos_label=pos_label,

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight, zero_division)
   1431     if beta < 0:
   1432         raise ValueError("beta should be >=0 in the F-beta score")
-> 1433     labels = _check_set_wise_labels(y_true, y_pred, average, labels,
   1434                                     pos_label)
   1435 

D:\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py in _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
   1248                          str(average_options))
   1249 
-> 1250     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
   1251     present_labels = unique_labels(y_true, y_pred)
   1252     if average == 'binary':

D:\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
     88 
     89     if len(y_type) > 1:
---> 90         raise ValueError("Classification metrics can't handle a mix of {0} "
     91                          "and {1} targets".format(type_true, type_pred))
     92 

ValueError: Classification metrics can't handle a mix of binary and continuous targets
In [212]:
ab_regressor_score=get_model_score(ab_regressor)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-212-f967bbca19b5> in <module>
----> 1 ab_regressor_score=get_model_score(ab_regressor)

<ipython-input-59-a82c9e7246df> in get_model_score(model, flag)
      8     score_list=[]
      9 
---> 10     pred_train = model.predict(X_train)
     11     pred_test = model.predict(X_test)
     12 

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in predict(self, X)
   1146         X = self._check_X(X)
   1147 
-> 1148         return self._get_median_predict(X, len(self.estimators_))
   1149 
   1150     def staged_predict(self, X):

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in _get_median_predict(self, X, limit)
   1110     def _get_median_predict(self, X, limit):
   1111         # Evaluate predictions of all estimators
-> 1112         predictions = np.array([
   1113             est.predict(X) for est in self.estimators_[:limit]]).T
   1114 

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in <listcomp>(.0)
   1111         # Evaluate predictions of all estimators
   1112         predictions = np.array([
-> 1113             est.predict(X) for est in self.estimators_[:limit]]).T
   1114 
   1115         # Sort the predictions

D:\Anaconda3\lib\site-packages\sklearn\tree\_classes.py in predict(self, X, check_input)
    425         """
    426         check_is_fitted(self)
--> 427         X = self._validate_X_predict(X, check_input)
    428         proba = self.tree_.predict(X)
    429         n_samples = X.shape[0]

D:\Anaconda3\lib\site-packages\sklearn\tree\_classes.py in _validate_X_predict(self, X, check_input)
    394         n_features = X.shape[1]
    395         if self.n_features_ != n_features:
--> 396             raise ValueError("Number of features of the model must "
    397                              "match the input. Model n_features is %s and "
    398                              "input n_features is %s "

ValueError: Number of features of the model must match the input. Model n_features is 1 and input n_features is 14 

Hyperparameter Tuning

In [183]:
# Choose the type of classifier. 
ab_tuned = AdaBoostRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {'n_estimators': np.arange(10,100,10), 
              'learning_rate': [1, 0.1, 0.5, 0.01],
              }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(ab_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
ab_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
ab_tuned.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-183-54a01998974a> in <module>
     12 # Run the grid search
     13 grid_obj = GridSearchCV(ab_tuned, parameters, scoring=scorer,cv=5)
---> 14 grid_obj = grid_obj.fit(X_train, y_train)
     15 
     16 # Set the clf to the best combination of parameters

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    763             refit_start_time = time.time()
    764             if y is not None:
--> 765                 self.best_estimator_.fit(X, y, **fit_params)
    766             else:
    767                 self.best_estimator_.fit(X, **fit_params)

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in fit(self, X, y, sample_weight)
   1005 
   1006         # Fit
-> 1007         return super().fit(X, y, sample_weight)
   1008 
   1009     def _validate_estimator(self):

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in fit(self, X, y, sample_weight)
    102             raise ValueError("learning_rate must be greater than zero")
    103 
--> 104         X, y = self._validate_data(X, y,
    105                                    accept_sparse=['csr', 'csc'],
    106                                    ensure_2d=True,

D:\Anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    805                         ensure_2d=False, dtype=None)
    806     else:
--> 807         y = column_or_1d(y, warn=True)
    808         _assert_all_finite(y)
    809     if y_numeric and y.dtype.kind == 'O':

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
    843         return np.ravel(y)
    844 
--> 845     raise ValueError(
    846         "y should be a 1d array, "
    847         "got an array of shape {} instead.".format(shape))

ValueError: y should be a 1d array, got an array of shape (3421, 10) instead.
In [ ]:
ab_tuned_score=get_model_score(ab_tuned)
In [184]:
# importance of features in the tree building

print(pd.DataFrame(ab_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-184-db17e0b97634> in <module>
      1 # importance of features in the tree building
      2 
----> 3 print(pd.DataFrame(ab_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in feature_importances_(self)
    246             The feature importances.
    247         """
--> 248         if self.estimators_ is None or len(self.estimators_) == 0:
    249             raise ValueError("Estimator not fitted, "
    250                              "call `fit` before `feature_importances_`.")

AttributeError: 'AdaBoostRegressor' object has no attribute 'estimators_'
In [185]:
feature_names = X_train.columns
importances = ab_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-185-86bc0d56c4be> in <module>
      1 feature_names = X_train.columns
----> 2 importances = ab_tuned.feature_importances_
      3 indices = np.argsort(importances)
      4 
      5 plt.figure(figsize=(12,12))

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_weight_boosting.py in feature_importances_(self)
    246             The feature importances.
    247         """
--> 248         if self.estimators_ is None or len(self.estimators_) == 0:
    249             raise ValueError("Estimator not fitted, "
    250                              "call `fit` before `feature_importances_`.")

AttributeError: 'AdaBoostRegressor' object has no attribute 'estimators_'

Gradient Boosting Regressor

In [207]:
gb_estimator=GradientBoostingRegressor(random_state=1)
gb_estimator.fit(X_train_ada,y_train_ada)
Out[207]:
GradientBoostingRegressor(random_state=1)
In [208]:
gb_estimator_score=get_model_score(gb_estimator)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-208-6b6060b5573d> in <module>
----> 1 gb_estimator_score=get_model_score(gb_estimator)

<ipython-input-59-a82c9e7246df> in get_model_score(model, flag)
      8     score_list=[]
      9 
---> 10     pred_train = model.predict(X_train)
     11     pred_test = model.predict(X_test)
     12 

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py in predict(self, X)
   1609         X = check_array(X, dtype=DTYPE, order="C", accept_sparse='csr')
   1610         # In regression we can directly return the raw value from the trees.
-> 1611         return self._raw_predict(X).ravel()
   1612 
   1613     def staged_predict(self, X):

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py in _raw_predict(self, X)
    616     def _raw_predict(self, X):
    617         """Return the sum of the trees raw predictions (+ init estimator)."""
--> 618         raw_predictions = self._raw_predict_init(X)
    619         predict_stages(self.estimators_, X, self.learning_rate,
    620                        raw_predictions)

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py in _raw_predict_init(self, X)
    602         """Check input and compute raw predictions of the init estimator."""
    603         self._check_initialized()
--> 604         X = self.estimators_[0, 0]._validate_X_predict(X, check_input=True)
    605         if X.shape[1] != self.n_features_:
    606             raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(

D:\Anaconda3\lib\site-packages\sklearn\tree\_classes.py in _validate_X_predict(self, X, check_input)
    394         n_features = X.shape[1]
    395         if self.n_features_ != n_features:
--> 396             raise ValueError("Number of features of the model must "
    397                              "match the input. Model n_features is %s and "
    398                              "input n_features is %s "

ValueError: Number of features of the model must match the input. Model n_features is 1 and input n_features is 14 

Hyperparameter Tuning

In [188]:
# Choose the type of classifier. 
gb_tuned = GradientBoostingRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {'n_estimators': np.arange(50,200,25), 
              'subsample':[0.7,0.8,0.9,1],
              'max_features':[0.7,0.8,0.9,1],
              'max_depth':[3,5,7,10]
              }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(gb_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
gb_tuned.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-188-8132efdfdf8f> in <module>
     14 # Run the grid search
     15 grid_obj = GridSearchCV(gb_tuned, parameters, scoring=scorer,cv=5)
---> 16 grid_obj = grid_obj.fit(X_train, y_train)
     17 
     18 # Set the clf to the best combination of parameters

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    763             refit_start_time = time.time()
    764             if y is not None:
--> 765                 self.best_estimator_.fit(X, y, **fit_params)
    766             else:
    767                 self.best_estimator_.fit(X, **fit_params)

D:\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py in fit(self, X, y, sample_weight, monitor)
    415         sample_weight = _check_sample_weight(sample_weight, X)
    416 
--> 417         y = column_or_1d(y, warn=True)
    418         y = self._validate_y(y, sample_weight)
    419 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
    843         return np.ravel(y)
    844 
--> 845     raise ValueError(
    846         "y should be a 1d array, "
    847         "got an array of shape {} instead.".format(shape))

ValueError: y should be a 1d array, got an array of shape (3421, 10) instead.
In [ ]:
gb_tuned_score=get_model_score(gb_tuned)
In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the 
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(pd.DataFrame(gb_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
In [ ]:
feature_names = X_train.columns
importances = gb_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

XGBoost Regressor

In [189]:
xgb_estimator=XGBRegressor(random_state=1)
xgb_estimator.fit(X_train,y_train)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-189-24e7c8c2cd33> in <module>
      1 xgb_estimator=XGBRegressor(random_state=1)
----> 2 xgb_estimator.fit(X_train,y_train)

D:\Anaconda3\lib\site-packages\xgboost\core.py in inner_f(*args, **kwargs)
    420         for k, arg in zip(sig.parameters, args):
    421             kwargs[k] = arg
--> 422         return f(**kwargs)
    423 
    424     return inner_f

D:\Anaconda3\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, feature_weights, callbacks)
    576         evals_result = {}
    577 
--> 578         train_dmatrix, evals = self._wrap_evaluation_matrices(
    579             X, y, group=None, sample_weight=sample_weight, base_margin=base_margin,
    580             feature_weights=feature_weights, eval_set=eval_set,

D:\Anaconda3\lib\site-packages\xgboost\sklearn.py in _wrap_evaluation_matrices(self, X, y, group, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, eval_group, label_transform)
    263 
    264         y = label_transform(y)
--> 265         train_dmatrix = DMatrix(data=X, label=y, weight=sample_weight,
    266                                 base_margin=base_margin,
    267                                 missing=self.missing, nthread=self.n_jobs)

D:\Anaconda3\lib\site-packages\xgboost\core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, enable_categorical)
    507         self.handle = handle
    508 
--> 509         self.set_info(label=label, weight=weight, base_margin=base_margin)
    510 
    511         self.feature_names = feature_names

D:\Anaconda3\lib\site-packages\xgboost\core.py in inner_f(*args, **kwargs)
    420         for k, arg in zip(sig.parameters, args):
    421             kwargs[k] = arg
--> 422         return f(**kwargs)
    423 
    424     return inner_f

D:\Anaconda3\lib\site-packages\xgboost\core.py in set_info(self, label, weight, base_margin, group, label_lower_bound, label_upper_bound, feature_names, feature_types, feature_weights)
    528         '''Set meta info for DMatrix.'''
    529         if label is not None:
--> 530             self.set_label(label)
    531         if weight is not None:
    532             self.set_weight(weight)

D:\Anaconda3\lib\site-packages\xgboost\core.py in set_label(self, label)
    657         """
    658         from .data import dispatch_meta_backend
--> 659         dispatch_meta_backend(self, label, 'label', 'float')
    660 
    661     def set_weight(self, weight):

D:\Anaconda3\lib\site-packages\xgboost\data.py in dispatch_meta_backend(matrix, data, name, dtype)
    654     '''Dispatch for meta info.'''
    655     handle = matrix.handle
--> 656     _validate_meta_shape(data)
    657     if data is None:
    658         return

D:\Anaconda3\lib\site-packages\xgboost\data.py in _validate_meta_shape(data)
    584 def _validate_meta_shape(data):
    585     if hasattr(data, 'shape'):
--> 586         assert len(data.shape) == 1 or (
    587             len(data.shape) == 2 and
    588             (data.shape[1] == 0 or data.shape[1] == 1))

AssertionError: 
In [190]:
xgb_estimator_score=get_model_score(xgb_estimator)
---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-190-961528ecec79> in <module>
----> 1 xgb_estimator_score=get_model_score(xgb_estimator)

<ipython-input-59-a82c9e7246df> in get_model_score(model, flag)
      8     score_list=[]
      9 
---> 10     pred_train = model.predict(X_train)
     11     pred_test = model.predict(X_test)
     12 

D:\Anaconda3\lib\site-packages\xgboost\sklearn.py in predict(self, data, output_margin, ntree_limit, validate_features, base_margin)
    649         if ntree_limit is None:
    650             ntree_limit = getattr(self, "best_ntree_limit", 0)
--> 651         return self.get_booster().predict(test_dmatrix,
    652                                           output_margin=output_margin,
    653                                           ntree_limit=ntree_limit,

D:\Anaconda3\lib\site-packages\xgboost\sklearn.py in get_booster(self)
    310         if not hasattr(self, '_Booster'):
    311             from sklearn.exceptions import NotFittedError
--> 312             raise NotFittedError('need to call fit or load_model beforehand')
    313         return self._Booster
    314 

NotFittedError: need to call fit or load_model beforehand

Hyperparameter Tuning

In [191]:
# Choose the type of classifier. 
xgb_tuned = XGBRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {'n_estimators': [75,100,125,150], 
              'subsample':[0.7, 0.8, 0.9, 1],
              'gamma':[0, 1, 3, 5],
              'colsample_bytree':[0.7, 0.8, 0.9, 1],
              'colsample_bylevel':[0.7, 0.8, 0.9, 1]
              }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
xgb_tuned.fit(X_train, y_train)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-191-cf8fc82921c4> in <module>
     15 # Run the grid search
     16 grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer,cv=5)
---> 17 grid_obj = grid_obj.fit(X_train, y_train)
     18 
     19 # Set the clf to the best combination of parameters

D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

D:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    763             refit_start_time = time.time()
    764             if y is not None:
--> 765                 self.best_estimator_.fit(X, y, **fit_params)
    766             else:
    767                 self.best_estimator_.fit(X, **fit_params)

D:\Anaconda3\lib\site-packages\xgboost\core.py in inner_f(*args, **kwargs)
    420         for k, arg in zip(sig.parameters, args):
    421             kwargs[k] = arg
--> 422         return f(**kwargs)
    423 
    424     return inner_f

D:\Anaconda3\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, feature_weights, callbacks)
    576         evals_result = {}
    577 
--> 578         train_dmatrix, evals = self._wrap_evaluation_matrices(
    579             X, y, group=None, sample_weight=sample_weight, base_margin=base_margin,
    580             feature_weights=feature_weights, eval_set=eval_set,

D:\Anaconda3\lib\site-packages\xgboost\sklearn.py in _wrap_evaluation_matrices(self, X, y, group, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, eval_group, label_transform)
    263 
    264         y = label_transform(y)
--> 265         train_dmatrix = DMatrix(data=X, label=y, weight=sample_weight,
    266                                 base_margin=base_margin,
    267                                 missing=self.missing, nthread=self.n_jobs)

D:\Anaconda3\lib\site-packages\xgboost\core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, enable_categorical)
    507         self.handle = handle
    508 
--> 509         self.set_info(label=label, weight=weight, base_margin=base_margin)
    510 
    511         self.feature_names = feature_names

D:\Anaconda3\lib\site-packages\xgboost\core.py in inner_f(*args, **kwargs)
    420         for k, arg in zip(sig.parameters, args):
    421             kwargs[k] = arg
--> 422         return f(**kwargs)
    423 
    424     return inner_f

D:\Anaconda3\lib\site-packages\xgboost\core.py in set_info(self, label, weight, base_margin, group, label_lower_bound, label_upper_bound, feature_names, feature_types, feature_weights)
    528         '''Set meta info for DMatrix.'''
    529         if label is not None:
--> 530             self.set_label(label)
    531         if weight is not None:
    532             self.set_weight(weight)

D:\Anaconda3\lib\site-packages\xgboost\core.py in set_label(self, label)
    657         """
    658         from .data import dispatch_meta_backend
--> 659         dispatch_meta_backend(self, label, 'label', 'float')
    660 
    661     def set_weight(self, weight):

D:\Anaconda3\lib\site-packages\xgboost\data.py in dispatch_meta_backend(matrix, data, name, dtype)
    654     '''Dispatch for meta info.'''
    655     handle = matrix.handle
--> 656     _validate_meta_shape(data)
    657     if data is None:
    658         return

D:\Anaconda3\lib\site-packages\xgboost\data.py in _validate_meta_shape(data)
    584 def _validate_meta_shape(data):
    585     if hasattr(data, 'shape'):
--> 586         assert len(data.shape) == 1 or (
    587             len(data.shape) == 2 and
    588             (data.shape[1] == 0 or data.shape[1] == 1))

AssertionError: 
In [ ]:
xgb_tuned_score=get_model_score(xgb_tuned)
In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the 
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(pd.DataFrame(xgb_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
In [ ]:
feature_names = X_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Now, let's build a stacking model with the tuned models - decision tree, random forest and gradient boosting, then use XGBoost to get the final prediction.

In [ ]:
estimators=[('Decision Tree', dtree_tuned),('Random Forest', rf_tuned),
           ('Gradient Boosting', gb_tuned)]
final_estimator=XGBRegressor(random_state=1)
In [ ]:
stacking_estimator=StackingRegressor(estimators=estimators, final_estimator=final_estimator,cv=5)
stacking_estimator.fit(X_train,y_train)
In [ ]:
stacking_estimator_score=get_model_score(stacking_estimator)

Comparing all models

In [ ]:
# defining list of models
models = [dtree, dtree_tuned, rf_estimator, rf_tuned, ab_regressor, ab_tuned, gb_estimator, gb_tuned, xgb_estimator,
         xgb_tuned, stacking_estimator]

# defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train= []
rmse_test= []

# looping through all the models to get the rmse and r2 scores
for model in models:
    # accuracy score
    j = get_model_score(model,False)
    r2_train.append(j[0])
    r2_test.append(j[1])
    rmse_train.append(j[2])
    rmse_test.append(j[3])
In [ ]:
comparison_frame = pd.DataFrame({'Model':['Decision Tree','Tuned Decision Tree','Random Forest','Tuned Random Forest',
                                          'AdaBoost Regressor', 'Tuned AdaBoost Regressor',
                                          'Gradient Boosting Regressor', 'Tuned Gradient Boosting Regressor',
                                          'XGBoost Regressor',  'Tuned XGBoost Regressor','Stacking Regressor'], 
                                          'Train_r2': r2_train,'Test_r2': r2_test,
                                          'Train_RMSE':rmse_train,'Test_RMSE':rmse_test}) 
comparison_frame
In [ ]:
# So plot observed and predicted values of the test data for the best model i.e. tuned gradient boosting model
fig, ax = plt.subplots(figsize=(8, 6))
y_pred=gb_tuned.predict(X_test)
ax.scatter(y_test, y_pred, edgecolors=(0, 0, 1))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=3)
ax.set_xlabel('Observed')
ax.set_ylabel('Predicted')
ax.set_title("Observed vs Predicted")
plt.grid()
plt.show()
In [ ]: